library(foreign)
library(car)
## Loading required package: carData
library(nnet)
library(ggplot2)
library(reshape2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caret)
## Loading required package: lattice
library(yardstick)
## For binary classification, the first factor level is assumed to be the event.
## Use the argument `event_level = "second"` to alter this as needed.
##
## Attaching package: 'yardstick'
## The following objects are masked from 'package:caret':
##
## precision, recall, sensitivity, specificity
library(ggplot2)
In recent years, California has seen record-breaking numbers of wildfires. The wildfires impact wildlife and residents and also have serious environmental implications. These outcomes result from heightened CO2 levels, temperatures, and precipitation. Previous exploratory analysis has shown that lower soil moisture and rainfall led to larger fires between 1992 to 2015, while lower soil moisture and rain led to more wildfires caused by equipment use. It was also found that higher rainfall resulted in a higher chance that a wildfire was caused by lightning. Government spending seemed to have a marginal impact on suppressing wildfires in California.
With climate change being an increasingly relevant issue these days, this research studies how wildfires in California have changed over time. Specifically, we focused on wildfires from 1992 to 2013 and sought to investigate factors that may have contributed to a significant impact on the intensity or volume of wildfires in California over the years.
The first dataset we found detailed 24 years’ worth of nationwide wildfires from 1992 to 2015 along with their associated sizes, causes, time of occurrence, and other metadata. We felt that analyzing every state’s wildfire patterns would be an unfocused and unfruitful effort, so we decided to focus on California, as it deals with the most wildfires compared to other states in the United States.
However, we needed more data to conduct a more in-depth analysis, so we retrieved data from Cal-Adapt, which is an effort developed by the Geospatial Innovation Facility at the University of California, Berkeley focused on providing data that portrays climate change in California. From here, we were able to collect data about California’s daily air temperature, soil moisture, and rainfall. From there, we were able to combine it with our wildfire data by aggregating monthly averages. We also added California’s yearly budget to our combined dataset because we were curious as to how it played a factor in the volume and intensity of wildfires over the years.
The initial exploratory data analysis made it clear that most fires in California were correlated with high average air temperatures, low soil moisture, and periods of low rainfall. To predict factors that may cause future wildfires, the following questions need to be addressed:
To answer the first question, we used a correlation matrix to study the impact of temperature, soil moisture, and rainfall on the fire size. The correlation matrix showed a weak positive correlation between average air temperature and the fire size, whereas there was a strong negative correlation between fire size and soil moisture and fire size and rainfall.
We studied the frequency of different fire class sizes to answer the second question. It was observed that the fire size class in California has a high frequency of class A and B fire sizes. These fire class sizes of wildfires are not considered a threat to the environment and nearby properties since they typically disappear soon. The fire class size C and higher could be dangerous as the burn size grows fast in such fire cases.
To answer the second question, we looked at plots of California’s fire suppression budget and wildfire count over the years to see if there was anything noteworthy. Below, we can see that there is no strong relationship between the money spent controlling wildfires and the number of wildfires that occur each year.
Through EDA, we found that the fire size class in California has a very high frequency in A and B fire class size. From the dataset, we can find that wildfires in class A and B sizes are observed every day, and those kinds of wildfires are self-limiting and are expected to disappear soon. As previously mentioned, other kinds of wildfires fall under fire class size C or higher. Those kinds of wildfires are not typically observed frequently and could burn over a larger area. Such wildfires are a threat to the environment, wildlife, and people living in the area. The size C fire class and above are thus, considered dangerous as the burned size grows fast. So, this research focuses on the analysis of larger fires. The class of fire with C or above appears in 4410 days in California from 1992 to 2015. We are trying to make a model that predicts the chances of getting a wildfire by given average temperature, soil moisture, and rainfall by building a logit regression model.
In the dataset, we labeled the days of no observation with 0 and days of at least 1 observation with 1. During training, we try several configurations on input features. It is found that the model with temperature, soil moisture, and rainfall cannot be improved with temperature and soil moisture included in the model. The model with temperature and soil moisture achieved the AUC score of 0.9017. Even though the results are high enough to predict the chances, we need to forecast the situation; this model needs the simultaneous data of input features to predict wildfire; it could be more convenient to use the model with the daily data to predict if there will be any fires the next day. In order to increase the useability of this model, we adjust the model to predict the wildfires in the next day by merging the observations of the previous day’s environmental data. We then trained the logit model with this adjusted data set, and the model with parameters of soil moisture and temperature get an AUC of 0.9002. This AUC score is close to the model with same-day data. Next, we simplify the model by reducing the feature required in prediction. We find the model with temperature input has an AUC of 0.8799. This model is still useful for predicting wildfires the next day with fewer data points. To be specific, in the confusion matrix, the model with temperature and soil moisture input has 689 types 2 errors out of 8036 total days, and the model with only temperature input has 780 type 2 errors.
library(dplyr)
data_wildfire=read.csv('final_wildfire.csv')
data_large=subset(data_wildfire, FIRE_SIZE_CLASS != 'A' & FIRE_SIZE_CLASS != 'B')
temp3=data_large %>% count(Year, DISCOVERY_DOY)
rainfall <- read.csv("rainfall_daily.csv")
soilmoisture <- read.csv("soilmoisture_daily.csv")
airtemp <- read.csv("airtemp_daily.csv")
airtemp$month <- strftime(airtemp$time, "%m")
airtemp$year <- strftime(airtemp$time, "%Y")
airtemp$DOY <- strftime(airtemp$time, "%j")
rainfall$year <- strftime(rainfall$time, "%Y")
rainfall$DOY <- strftime(airtemp$time, "%j")
cleaned_rainfall <- subset(rainfall, select = -c(time))
str(airtemp)
## 'data.frame': 23376 obs. of 5 variables:
## $ time : chr "1950-01-01 00:00:00+00:00" "1950-01-02 00:00:00+00:00" "1950-01-03 00:00:00+00:00" "1950-01-04 00:00:00+00:00" ...
## $ tair_day_livneh_vic: num 4.14 1.77 -3.09 -3.59 -2.75 ...
## $ month : chr "01" "01" "01" "01" ...
## $ year : chr "1950" "1950" "1950" "1950" ...
## $ DOY : chr "001" "002" "003" "004" ...
cleaned_airtemp <- subset(airtemp, select = -c(time))
soilmoisture$year <- strftime(rainfall$time, "%Y")
soilmoisture$DOY <- strftime(airtemp$time, "%j")
cleaned_soilmoisture <- subset(soilmoisture, select = -c(time))
joined1 <- merge(cleaned_airtemp, cleaned_soilmoisture, by.x=c("year",'DOY'), by.y=c("year",'DOY'))
joined2 <- merge(joined1, cleaned_rainfall, by.x=c("year",'DOY'), by.y=c("year",'DOY'))
joined2$year=as.integer(joined2$year)
joined2$DOY=as.integer(joined2$DOY)
str(joined2)
## 'data.frame': 23376 obs. of 6 variables:
## $ year : int 1950 1950 1950 1950 1950 1950 1950 1950 1950 1950 ...
## $ DOY : int 1 2 3 4 5 6 7 8 9 10 ...
## $ tair_day_livneh_vic : num 4.14 1.77 -3.09 -3.59 -2.75 ...
## $ month : chr "01" "01" "01" "01" ...
## $ soilmoist1_day_livneh_vic: num 18.3 18.5 18.3 18.3 18.1 ...
## $ rainfall_day_livneh_vic : num 0.67193 0.79325 0.07883 0.08991 0.00174 ...
joined2=subset(joined2, year>= 1992 & year<= 2015)
joined5=joined2
joined5$DOY=joined5$DOY + 1
data_large=merge(joined2,temp3, by.x=c('year','DOY'),by.y=c('Year','DISCOVERY_DOY'),all.x=T)
data_large$n[is.na(data_large$n)]=0
data_large$fire=1
data_large$fire[data_large$n==0]=0
str(data_large)
## 'data.frame': 8036 obs. of 8 variables:
## $ year : int 1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 ...
## $ DOY : int 1 2 3 4 5 6 7 8 9 10 ...
## $ tair_day_livneh_vic : num 3.78 4.17 4.19 4.87 4.94 ...
## $ month : chr "01" "01" "01" "01" ...
## $ soilmoist1_day_livneh_vic: num 20.2 20.5 21.4 23.6 25.1 ...
## $ rainfall_day_livneh_vic : num 0.0616 1.1337 3.0754 7.0797 10.8035 ...
## $ n : num 0 0 0 0 0 0 0 0 0 0 ...
## $ fire : num 0 0 0 0 0 0 0 0 0 0 ...
data_large2=merge(joined5,temp3, by.x=c('year','DOY'),by.y=c('Year','DISCOVERY_DOY'),all.x=T)
data_large2$n[is.na(data_large2$n)]=0
data_large2$fire=1
data_large2$fire[data_large2$n==0]=0
for (i in range(1992,2015)){
tdate=max(subset(data_large2,year==i)$DOY)
data_large2$year[data_large$year == i & data_large2$DOY==tdate]=i+1
data_large2$DOY[data_large$year == i & data_large2$DOY==tdate]=1
}
## Warning in max(subset(data_large2, year == i)$DOY): no non-missing arguments to
## max; returning -Inf
str(data_large2)
## 'data.frame': 8036 obs. of 8 variables:
## $ year : num 1992 1992 1992 1992 1992 ...
## $ DOY : num 2 3 4 5 6 7 8 9 10 11 ...
## $ tair_day_livneh_vic : num 3.78 4.17 4.19 4.87 4.94 ...
## $ month : chr "01" "01" "01" "01" ...
## $ soilmoist1_day_livneh_vic: num 20.2 20.5 21.4 23.6 25.1 ...
## $ rainfall_day_livneh_vic : num 0.0616 1.1337 3.0754 7.0797 10.8035 ...
## $ n : num 0 0 0 0 0 0 0 0 0 0 ...
## $ fire : num 0 0 0 0 0 0 0 0 0 0 ...
model_large=glm(fire~tair_day_livneh_vic+soilmoist1_day_livneh_vic,data=data_large,family=binomial())
#summary(model_large)
xkabledply(model_large, title = paste("Logit Regression :",format(formula(model_large)) ))
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 3.0691 | 0.3117 | 9.8458 | 0 |
| tair_day_livneh_vic | 0.1461 | 0.0080 | 18.1739 | 0 |
| soilmoist1_day_livneh_vic | -0.3157 | 0.0145 | -21.7253 | 0 |
library('pROC')
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
prob=predict(model_large, type = "response" )
data_large$prob=prob
h = roc(fire~prob, data=data_large)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(h)
## Area under the curve: 0.9017
plot(h)
library("regclass")
## Loading required package: bestglm
## Loading required package: leaps
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
##
## Attaching package: 'VGAM'
## The following object is masked from 'package:caret':
##
## predictors
## The following object is masked from 'package:car':
##
## logit
## Loading required package: rpart
## Loading required package: randomForest
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## Important regclass change from 1.3:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.
##
## Attaching package: 'regclass'
## The following object is masked from 'package:lattice':
##
## qq
xkabledply( confusion_matrix(model_large), title = "Confusion matrix for large fire predictor" )
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 3265 | 659 | 3924 |
| Actual 1 | 692 | 3420 | 4112 |
| Total | 3957 | 4079 | 8036 |
To try to increase usability for the model, we try to make a model to predict the probability in next day:
model_large=glm(fire~tair_day_livneh_vic+soilmoist1_day_livneh_vic+soilmoist1_day_livneh_vic+rainfall_day_livneh_vic,data=data_large2,family=binomial())
#summary(model_large)
xkabledply(model_large, title = paste("Predicting Probability :", format(formula(model_large))))
## Warning in if (wtitle == "") {: the condition has length > 1 and only the first
## element will be used
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 2.4707 | 0.3245 | 7.6148 | 0 |
| tair_day_livneh_vic | 0.1508 | 0.0081 | 18.6813 | 0 |
| soilmoist1_day_livneh_vic | -0.2743 | 0.0158 | -17.3927 | 0 |
| rainfall_day_livneh_vic | -0.1312 | 0.0256 | -5.1318 | 0 |
library('pROC')
prob=predict(model_large, type = "response" )
data_large2$prob=prob
h = roc(fire~prob, data=data_large2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(h)
## Area under the curve: 0.9002
plot(h)
library("regclass")
library("ezids")
xkabledply( confusion_matrix(model_large), title = "Confusion matrix for large fire predictor" )
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 3244 | 680 | 3924 |
| Actual 1 | 689 | 3423 | 4112 |
| Total | 3933 | 4103 | 8036 |
model_large=glm(fire~tair_day_livneh_vic,data=data_large2,family=binomial())
#summary(model_large)
xkabledply(model_large, title = paste("Logit Regression :", format(formula(model_large))))
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -3.7380 | 0.0821 | -45.5091 | 0 |
| tair_day_livneh_vic | 0.2833 | 0.0058 | 48.6031 | 0 |
library('pROC')
prob=predict(model_large, type = "response" )
data_large2$prob=prob
h = roc(fire~prob, data=data_large2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(h)
## Area under the curve: 0.8799
plot(h)
library("regclass")
library("ezids")
xkabledply( confusion_matrix(model_large), title = "Confusion matrix for large fire predictor" )
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 3222 | 702 | 3924 |
| Actual 1 | 780 | 3332 | 4112 |
| Total | 4002 | 4034 | 8036 |
The results suppose that with temperature rainfall and soil moisture data from today; We have AUC 0.9 for predicting the the large fire in next day If we try to make a convenient model that require only temperature to predict the probability, the AUC is 0.88.
we are try to figure out that can we predict the specific fire size of a large wildfire(class c or above)
data_wildfire$STAT_CAUSE_CODE <- as.factor(data_wildfire$STAT_CAUSE_CODE)
data_wildfire$FIRE_SIZE_CLASS <- as.factor(data_wildfire$FIRE_SIZE_CLASS)
temp8= subset(data_wildfire, FIRE_SIZE_CLASS != 'A' & FIRE_SIZE_CLASS != 'B')
split <- createDataPartition(temp8$FIRE_SIZE_CLASS, p = .70, list = FALSE)
## Warning in createDataPartition(temp8$FIRE_SIZE_CLASS, p = 0.7, list = FALSE):
## Some classes have no records ( A, B ) and these will be ignored
train <- temp8[split,]
test <- temp8[-split,]
With our training data, we achieve an accuracy of 50.9% for predicting the fire size class of a wildfire based on that month’s condition. From our confusion matrix, it seems that our model is biased towards predicting that a fire is of Class A, which makes sense because smaller fires are much more frequent than larger fires.
model1 <- multinom(FIRE_SIZE_CLASS ~ tair_day_livneh_vic +
soilmoist1_day_livneh_vic +
rainfall_day_livneh_vic, data = train)
## Warning in multinom(FIRE_SIZE_CLASS ~ tair_day_livneh_vic +
## soilmoist1_day_livneh_vic + : groups 'A' 'B' are empty
## # weights: 25 (16 variable)
## initial value 15344.381057
## iter 10 value 9811.159474
## iter 20 value 9431.462254
## final value 9431.230655
## converged
summary(model1)
## Call:
## multinom(formula = FIRE_SIZE_CLASS ~ tair_day_livneh_vic + soilmoist1_day_livneh_vic +
## rainfall_day_livneh_vic, data = train)
##
## Coefficients:
## (Intercept) tair_day_livneh_vic soilmoist1_day_livneh_vic
## D -1.993649 0.01593043 0.009748272
## E -2.368006 0.02640819 -0.025071562
## F -4.536076 0.06298458 0.055103167
## G -3.482232 0.08401695 -0.127661125
## rainfall_day_livneh_vic
## D 0.045647775
## E 0.099442850
## F -0.006113015
## G 0.010006781
##
## Std. Errors:
## (Intercept) tair_day_livneh_vic soilmoist1_day_livneh_vic
## D 0.4347072 0.009027252 0.02342523
## E 0.5731534 0.011878893 0.03124641
## F 0.7292202 0.015333614 0.03895641
## G 1.1418168 0.023236312 0.06674467
## rainfall_day_livneh_vic
## D 0.04289798
## E 0.05069511
## F 0.07867477
## G 0.15289228
##
## Residual Deviance: 18862.46
## AIC: 18894.46
train$predictions <- predict(model1, newdata = train, "class")
cm <- confusionMatrix(train$predictions, train$FIRE_SIZE_CLASS)
## Warning in levels(reference) != levels(data): longer object length is not a
## multiple of shorter object length
## Warning in confusionMatrix.default(train$predictions, train$FIRE_SIZE_CLASS):
## Levels are not in the same order for reference and data. Refactoring data to
## match.
cmdf <- as.data.frame(cm$table)
cmdf$Prediction <- factor(cmdf$Prediction, levels=rev(levels(cmdf$Prediction)))
tab <- table(train$FIRE_SIZE_CLASS, train$predictions)
round((sum(diag(tab))/sum(tab))*100,2)
## [1] 0
ggplot(cmdf, aes(Reference,Prediction, fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) +
scale_fill_gradient(low="white", high="#009194") +
labs(x = "Reference",y = "Prediction", title="Confusion Matrix for Predicted Fire Size")
test$predictions <- predict(model1, newdata = test, "class")
tab <- table(test$FIRE_SIZE_CLASS, test$predictions)
round((sum(diag(tab))/sum(tab))*100,2)
## [1] 0
cm <- confusionMatrix(test$predictions, test$FIRE_SIZE_CLASS)
## Warning in levels(reference) != levels(data): longer object length is not a
## multiple of shorter object length
## Warning in confusionMatrix.default(test$predictions, test$FIRE_SIZE_CLASS):
## Levels are not in the same order for reference and data. Refactoring data to
## match.
cmdf <- as.data.frame(cm$table)
cmdf$Prediction <- factor(cmdf$Prediction, levels=rev(levels(cmdf$Prediction)))
ggplot(cmdf, aes(Reference,Prediction, fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) +
scale_fill_gradient(low="white", high="#009194") +
labs(x = "Reference",y = "Prediction", title="Confusion Matrix for Predicted Fire Size")
We found that since the large wildfire dataset is not balanced, the model fails to predict class since the model strongly predict results as C class, this suppose that we need more data and try alternative model on predict such categroy.
With our training data, the model could not predict class since it predict all of labels to majority of class. This makes sense because there does not seem to be one predominant size of the wildfires in our dataset. And we conclude this model is not useable
We also wanted to evaluate the long term effect on the wildfires by nature factors. We tried to predict the cases of wildfires, average fire size and total burned fire by nature factors.
temp=aggregate(tair_day_livneh_vic~Year+month,data=data_wildfire,mean)
temp1=aggregate(soilmoist1_day_livneh_vic~Year+month,data=data_wildfire,mean)
temp2=aggregate(rainfall_day_livneh_vic~Year+month,data=data_wildfire,mean)
joined3=merge(temp,temp1,by.x=c('Year','month'),by.y=c('Year','month'))
joined3=merge(joined3,temp2,by.x=c('Year','month'),by.y=c('Year','month'))
temp1=data_wildfire %>% count(Year, month)
temp2=aggregate(FIRE_SIZE~Year+month,data=data_wildfire,mean)
joined4 = merge(temp2, temp1, by.x=c('Year', 'month'), by.y=c('Year', 'month'))
joined4 = merge(joined3, joined4, by.x=c('Year','month'), by.y=c('Year', 'month'))
library(corrplot)
## corrplot 0.92 loaded
joined4$totalarea=joined4$n*joined4$FIRE_SIZE
cor_fire=cor(joined4[c(3,4,5,6,7,8)])
corrplot(cor_fire,method='number',type = 'lower', diag = TRUE)
From the corrplot, we can find that the number of case have strong correlation with environment variable. And the fire size, total area burned have some correlation with environment variable. We try to use linear model for those predicted variable, the result suppose that the model is not fit well. Then, we take log to those predicted variable. Then the models are improved.
#Total Fire Cases
model_case=lm(log(n)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_case)
xkabledply(model_case, title = paste("Linear Regression :", format(formula(model_case))))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 8.8261 | 0.2806 | 31.4571 | 0 |
| tair_day_livneh_vic | 0.0526 | 0.0068 | 7.7415 | 0 |
| soilmoist1_day_livneh_vic | -0.2322 | 0.0126 | -18.4468 | 0 |
VIF(model_case)
## tair_day_livneh_vic soilmoist1_day_livneh_vic
## 3.653691 3.653691
plot(model_case)
#Average Fire Size
model_avgsize=lm(log(FIRE_SIZE)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_avgsize)
xkabledply(model_avgsize, title = paste("Linear Regression :", format(formula(model_avgsize))))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.4512 | 0.8955 | 7.2041 | 0.0000 |
| tair_day_livneh_vic | 0.0409 | 0.0217 | 1.8858 | 0.0604 |
| soilmoist1_day_livneh_vic | -0.2994 | 0.0402 | -7.4540 | 0.0000 |
VIF(model_avgsize)
## tair_day_livneh_vic soilmoist1_day_livneh_vic
## 3.653691 3.653691
plot(model_avgsize)
#Total Fire Area
model_totalarea=lm(log(totalarea)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_totalarea)
xkabledply(model_totalarea, title = paste("Linear Regression :", format(formula(model_totalarea))))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 15.2773 | 0.9527 | 16.0351 | 0e+00 |
| tair_day_livneh_vic | 0.0936 | 0.0231 | 4.0523 | 1e-04 |
| soilmoist1_day_livneh_vic | -0.5316 | 0.0427 | -12.4385 | 0e+00 |
VIF(model_totalarea)
## tair_day_livneh_vic soilmoist1_day_livneh_vic
## 3.653691 3.653691
plot(model_totalarea)
The r-squared value for model of cases is 0.9 The r-squared value for model of average fire size 0.5377 The r-squared value for model of average total burned size with 0.7825 By the plot check we found that the cases and burned area model fit the linear model assumption well.
temp=aggregate(tair_day_livneh_vic~Year+month,data=data_wildfire,mean)
temp1=aggregate(soilmoist1_day_livneh_vic~Year+month,data=data_wildfire,mean)
temp2=aggregate(rainfall_day_livneh_vic~Year+month,data=data_wildfire,mean)
joined3=merge(temp,temp1,by.x=c('Year','month'),by.y=c('Year','month'))
joined3=merge(joined3,temp2,by.x=c('Year','month'),by.y=c('Year','month'))
temp1=subset(data_wildfire, FIRE_SIZE_CLASS != 'A' & FIRE_SIZE_CLASS != 'B')
temp2=aggregate(FIRE_SIZE~Year+month,data=temp1,mean)
temp1=temp1 %>% count(Year, month)
joined4 = merge(temp2, temp1, by.x=c('Year', 'month'), by.y=c('Year', 'month'))
joined4 = merge(joined3, joined4, by.x=c('Year','month'), by.y=c('Year', 'month'))
We build models to predict the large wildfire number firesize and total burned area.
library(corrplot)
joined4$totalarea=joined4$n*joined4$FIRE_SIZE
cor_fire=cor(joined4[c(3,4,5,6,7,8)])
corrplot(cor_fire,method='number',type = 'lower', diag = TRUE)
model_case=lm(log(n)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_case)
xkabledply(model_case, title = paste("Case Model:",format(formula(model_case)) ))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 6.1644 | 0.4773 | 12.9163 | 0 |
| tair_day_livneh_vic | 0.0904 | 0.0112 | 8.0897 | 0 |
| soilmoist1_day_livneh_vic | -0.2847 | 0.0219 | -12.9789 | 0 |
plot(model_case)
VIF(model_case)
## tair_day_livneh_vic soilmoist1_day_livneh_vic
## 3.733188 3.733188
model_avgsize=lm(log(FIRE_SIZE)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_avgsize)
xkabledply(model_avgsize, title = paste("Avg. Size Model Model:",format(formula(model_avgsize)) ))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 7.4492 | 0.9322 | 7.9908 | 0.0000 |
| tair_day_livneh_vic | 0.0407 | 0.0218 | 1.8642 | 0.0635 |
| soilmoist1_day_livneh_vic | -0.1770 | 0.0428 | -4.1300 | 0.0000 |
plot(model_avgsize)
VIF(model_avgsize)
## tair_day_livneh_vic soilmoist1_day_livneh_vic
## 3.733188 3.733188
model_totalarea=lm(log(totalarea)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_totalarea)
xkabledply(model_totalarea, title = paste("Burn Area Model:",format(formula(model_totalarea)) ))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 13.6135 | 1.1019 | 12.3546 | 0 |
| tair_day_livneh_vic | 0.1310 | 0.0258 | 5.0809 | 0 |
| soilmoist1_day_livneh_vic | -0.4616 | 0.0506 | -9.1154 | 0 |
plot(model_totalarea)
VIF(model_totalarea)
## tair_day_livneh_vic soilmoist1_day_livneh_vic
## 3.733188 3.733188
We have the model that r-squared value for model of cases 0.8616 model of average fire size 0.3366 model of average total burned size with 0.7391 By the plot check we found that the model of large fire case does not fit well as the residual is not consistent. And the model of average large fire size result shows lack of some normality from qq-plot.
First, linear modeling was used to predict the total number of fires in California, the fire size, and the burn area with respect to the environmental factors as the independent variables. These models were also looked at individually for naturally caused wildfires and those caused by people to see if there is a difference in the wildfire’s impact.
summary_nature=read.csv('summary_nature.csv')
summary_peoplecaused=read.csv('summary_peoplecaused.csv')
#Renaming variables in summary_nature
Avg_Temp <- summary_nature$tair_day_livneh_vic
Avg_SoilMoisture <- summary_nature$soilmoist1_day_livneh_vic
Avg_Rainfall <- summary_nature$rainfall_day_livneh_vic
#Renaming variables in summary_peoplecaused
Avg_Temp <- summary_peoplecaused$tair_day_livneh_vic
Avg_SoilMoisture <- summary_peoplecaused$soilmoist1_day_livneh_vic
Avg_Rainfall <- summary_peoplecaused$rainfall_day_livneh_vic
model_nnature <- lm(log(n)~ Avg_Temp + Avg_SoilMoisture, data = summary_nature)
#summary(model_nnature)
xkabledply(model_nnature, title = paste("Nature Caused:",format(formula(model_nnature)) ))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 8.1533 | 0.3724 | 21.8919 | 0 |
| Avg_Temp | 0.0595 | 0.0090 | 6.5931 | 0 |
| Avg_SoilMoisture | -0.2529 | 0.0167 | -15.1381 | 0 |
plot(model_nnature)
## Logn ~ nature variables - PEOPLE CAUSED (P)
model_npeople <- lm(log(n)~ Avg_Temp + Avg_SoilMoisture, data = summary_peoplecaused)
#summary(model_npeople)
xkabledply(model_npeople, title = paste("People Caused:",format(formula(model_npeople)) ))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 7.2814 | 0.3131 | 23.2584 | 0 |
| Avg_Temp | 0.0526 | 0.0076 | 6.9257 | 0 |
| Avg_SoilMoisture | -0.2251 | 0.0140 | -16.0264 | 0 |
plot(model_npeople)
The first model with the total number of fires as the dependent variable had an adjusted R-squared value of %. This means that the environmental factors can explain 87% of the total variation in the number of fires when the fires were naturally caused. The R-squared value was % for the model where the fires were man-made. The coefficients of the independent variables showed that, for both the models, there was a direct relationship between the air temperature and the number of fires and the soil moisture and the number of fires. There was an indirect relationship between rainfall and the number of fires.
## FIRE_SIZE ~ nature variables - NATURE CAUSED (N)
model_sizenature <- lm(log(FIRE_SIZE) ~ Avg_Temp + Avg_SoilMoisture , data = summary_nature)
#summary(model_sizenature)
xkabledply(model_sizenature, title = paste("Nature Caused:",format(formula(model_sizenature)) ))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 7.5705 | 1.0788 | 7.0177 | 0.0000 |
| Avg_Temp | 0.0291 | 0.0261 | 1.1144 | 0.2661 |
| Avg_SoilMoisture | -0.3783 | 0.0484 | -7.8172 | 0.0000 |
plot(model_sizenature)
model_sizepeople <- lm(log(FIRE_SIZE) ~ Avg_Temp + Avg_SoilMoisture , data = summary_peoplecaused)
#summary(model_sizepeople)
xkabledply(model_sizepeople, title = paste("People Caused:",format(formula(model_sizepeople) )))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 4.8934 | 1.1114 | 4.4029 | 0.0000 |
| Avg_Temp | 0.0756 | 0.0269 | 2.8048 | 0.0054 |
| Avg_SoilMoisture | -0.2687 | 0.0499 | -5.3907 | 0.0000 |
plot(model_sizepeople)
Next, we modeled the environmental factors and their impact on the fire size. The environmental factors can explain % of the variation in the fire size in the naturally caused fires model, whereas only % of the variation could be explained by those factors in the people caused fires model. The relationship of the independent variables was the same with the fire size as it was with the number of fires.
model_areanature <- lm(log(FIRE_SIZE*n) ~ Avg_Temp + Avg_SoilMoisture , data = summary_nature)
#summary(model_areanature)
xkabledply(model_areanature, title = paste("Nature Caused:",format(formula(model_areanature)) ))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 15.7238 | 1.1627 | 13.5237 | 0.0000 |
| Avg_Temp | 0.0887 | 0.0282 | 3.1459 | 0.0018 |
| Avg_SoilMoisture | -0.6312 | 0.0522 | -12.1022 | 0.0000 |
plot(model_areanature)
model_areapeople <- lm(log(FIRE_SIZE*n) ~ Avg_Temp + Avg_SoilMoisture , data = summary_peoplecaused)
#summary(model_areapeople)
xkabledply(model_areapeople, title = paste("People Caused:",format(formula(model_areapeople) )))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 12.1748 | 1.2004 | 10.1423 | 0 |
| Avg_Temp | 0.1281 | 0.0291 | 4.4031 | 0 |
| Avg_SoilMoisture | -0.4938 | 0.0538 | -9.1707 | 0 |
plot(model_areapeople)
Finally, we looked at the burn area and how environmental factors impact it. We found that % of the variation in burn size was explained in the naturally caused fires model, and % was explained in the people caused fires model. Again, the relationship between the variables stayed consistent with previous models.
When looking at the difference in the impact of the fires based on their cause, the linear models showed that regardless of if the wildfire was started naturally or was manmade, since the relationship between the variables remained consistent, the impact of both kinds of wildfires was equally as devastating.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.6 ✓ purrr 0.3.4
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x randomForest::combine() masks dplyr::combine()
## x tidyr::fill() masks VGAM::fill()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
## x randomForest::margin() masks ggplot2::margin()
## x dplyr::recode() masks car::recode()
## x purrr::some() masks car::some()
## x readr::spec() masks yardstick::spec()
## CART
##
## 151640 samples
## 4 predictor
## 7 classes: 'A', 'B', 'C', 'D', 'E', 'F', 'G'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 121312, 121313, 121313, 121312, 121310
## Resampling results across tuning parameters:
##
## maxdepth Accuracy Kappa
## 1 0.5242878 0.07930881
## 3 0.5242878 0.07930881
## 5 0.5242878 0.07930881
## 7 0.5242878 0.07930881
## 9 0.5242878 0.07930881
## 10 0.5242878 0.07930881
## 11 0.5242878 0.07930881
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 1.
## Warning: extra=106 but the response has 7 levels (only the 2nd level is
## displayed)
## predict_unseen
## A B C D E F G
## A 20041 1827 0 0 0 0 0
## B 12300 1594 0 0 0 0 0
## C 1244 215 0 0 0 0 0
## D 300 43 0 0 0 0 0
## E 135 23 0 0 0 0 0
## F 96 9 0 0 0 0 0
## G 75 8 0 0 0 0 0
## [1] "Accuracy for test 0.570693748351358"
## [1] 0.5732524
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E F G
## A 20041 12300 1244 300 135 96 75
## B 1827 1594 215 43 23 9 8
## C 0 0 0 0 0 0 0
## D 0 0 0 0 0 0 0
## E 0 0 0 0 0 0 0
## F 0 0 0 0 0 0 0
## G 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5707
## 95% CI : (0.5657, 0.5757)
## No Information Rate : 0.5768
## P-Value [Acc > NIR] : 0.9924
##
## Kappa : 0.0326
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E Class: F
## Sensitivity 0.9165 0.11473 0.00000 0.000000 0.000000 0.00000
## Specificity 0.1179 0.91152 1.00000 1.000000 1.000000 1.00000
## Pos Pred Value 0.5861 0.42861 NaN NaN NaN NaN
## Neg Pred Value 0.5087 0.64026 0.96151 0.990952 0.995832 0.99723
## Prevalence 0.5768 0.36650 0.03849 0.009048 0.004168 0.00277
## Detection Rate 0.5286 0.04205 0.00000 0.000000 0.000000 0.00000
## Detection Prevalence 0.9019 0.09810 0.00000 0.000000 0.000000 0.00000
## Balanced Accuracy 0.5172 0.51312 0.50000 0.500000 0.500000 0.50000
## Class: G
## Sensitivity 0.000000
## Specificity 1.000000
## Pos Pred Value NaN
## Neg Pred Value 0.997811
## Prevalence 0.002189
## Detection Rate 0.000000
## Detection Prevalence 0.000000
## Balanced Accuracy 0.500000
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E F G
## A 20272 12434 1254 304 137 96 75
## B 1596 1460 205 39 21 9 8
## C 0 0 0 0 0 0 0
## D 0 0 0 0 0 0 0
## E 0 0 0 0 0 0 0
## F 0 0 0 0 0 0 0
## G 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5733
## 95% CI : (0.5683, 0.5782)
## No Information Rate : 0.5768
## P-Value [Acc > NIR] : 0.922
##
## Kappa : 0.0338
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E Class: F
## Sensitivity 0.9270 0.10508 0.00000 0.000000 0.000000 0.00000
## Specificity 0.1086 0.92180 1.00000 1.000000 1.000000 1.00000
## Pos Pred Value 0.5864 0.43739 NaN NaN NaN NaN
## Neg Pred Value 0.5219 0.64034 0.96151 0.990952 0.995832 0.99723
## Prevalence 0.5768 0.36650 0.03849 0.009048 0.004168 0.00277
## Detection Rate 0.5347 0.03851 0.00000 0.000000 0.000000 0.00000
## Detection Prevalence 0.9119 0.08805 0.00000 0.000000 0.000000 0.00000
## Balanced Accuracy 0.5178 0.51344 0.50000 0.500000 0.500000 0.50000
## Class: G
## Sensitivity 0.000000
## Specificity 1.000000
## Pos Pred Value NaN
## Neg Pred Value 0.997811
## Prevalence 0.002189
## Detection Rate 0.000000
## Detection Prevalence 0.000000
## Balanced Accuracy 0.500000
Through basic exploratory data analysis and modeling, this research concluded a strong relationship between the environmental factors and the wildfires, its burn area and the fire size. The linear modeling helps to show that regardless of the caus of the fire, its impact remains equally as devastating.
The raw data on the environmental factors were recorded at a daily frequency. To conduct the research, the frequency was converted into monthly and yearly. To make a more accurate prediction about the wildfire, it would have been beneficial to look at the exact data for temperature, soil moisture, and rainfall data for the wildfire instead of a monthly or yearly average. In addition, our wildfire data only went up to 2015; it would have been useful to see how frequent and intense fires from the past 7 years have been, especially with the acceleration of global temperatures.
Through our EDA and modeling, we saw that we were dealing with incredibly imbalanced data, in that most of the fires recorded by the FPA (Fire Program Analysis) System fell into only a couple different categories. In particular, 48% of fires were caused by 3 main causes (debris burning, lightning, and miscellaneous reasons) out of 13 causes found in the dataset. In addition, just over 90% of wildfires fell into class A and B, which means that a vast majority of recorded fires were between 0 to 9.9 acres. We either need to collect more data on larger class fires or balance our data so that our models can more accurately predict wildfire causes and sizes.
In the future, it would also be interesting to take a deeper look into Class C fires and study the factors that cause larger fires and how it impacts the burn time of these larger fires. Lastly, we would be curious to see how the analytical and prediction techniques used to study the wildfires in California work for different states like Colorado and even the wildfires in Australia.
California’s fire suppression data was also not granular enough in that it did not detail where exactly the money was going. For the years that we could compare fire frequency to the budget, it did not look like the budget had any direct effect on the wildfires in California. As a result, we would recommend the State of California look into how the money is being spent. In addition, we believe that more specific data on how the budget is being spent will help with optimizing allocation to different efforts appropriately.